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Abstract — The co-sparse analysis model for signals assumes 
that the signal of interest can be multiplied by an analysis dic- 
tionary fl, leading to a sparse outcome. This model stands as an 
interesting alternative to the more classical synthesis based sparse 
representation model. In this work we propose a theoretical 
study of the performance guarantee of the thresholding algorithm 
for the pursuit problem in the presence of noise. Our analysis 
reveals two significant properties of fl, which govern the pursuit 
performance: The first is the degree of linear dependencies 
between sets of rows in fl, depicted by the co-sparsity level. The 
second property, termed the Restricted Orthogonal Projection 
Property (ROPP), is the level of independence between such 
dependent sets and other rows in Ct. We show how these 
dictionary properties are meaningful and useful, both in the 
theoretical bounds derived, and in a series of experiments that 
are shown to align well with the theoretical prediction. 

Index Terms — Sparse Representations, Analysis Model, 
Thresholding Algorithm, Probability of Success, Linear Depen- 
dencies, Restricted Orthogonal Projection Property (ROPP). 



I. Introduction 

Signal models lie at the core of various processing tasks, 
such as denoising, solving inverse problems, compression, 
interpolation, sampling, and more. One approach that has 
become very popular in the past decade is the synthesis-based 
sparse representation model. In this model, a signal x e M'' 
is assumed to be composed as a linear combination of a few 
atoms (columns) from a dictionary D £ M.'^^" |[T], IJ]- We 
typically consider a redundant dictionary with n > d. The 
vector a E M" is the sparse representation of the signal, i.e. 
l|a||o = k <$: d. 

Vast work on the synthesis model during the past decade 
has been invested in an attempt to better understand it, and 
build practical tools for its use. The main activity concentrated 
on problems such as how to perform pursuit of the sparse 
representation from the possibly corrupted signal, deriving 
theoretical success guarantees for such pursuit algorithms, and 
techniques to learn the dictionary D from signal examples. 
Referring specifically to the theoretical success guarantees, 
various measures were suggested along the years to formalize 
the notion of the suitability of a synthesis dictionary D for 
sparse estimation. These include mutual coherence ||3], ID, 
the exact recovery condition (ERC) 111, the spark |21 and the 
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restricted isometry property (RIP) ||6], Q, the capacity sets 
liSJ, the characteristics for "s-goodness" ||9], and others. 

Using these measures, theoretical performance guarantees 
were developed for various synthesis pursuit algorithms in 
different setups. For example, the work presented in llTOl 
provided a coherence-based guarantee on the probability of 
success for the thresholding algorithm in a noise-free setup, 
under certain assumptions on the representation coefficients. 
A later work, ifTTI . suggested coherence-based performance 
guarantees for a wide range of pursuit algorithms, including 
the thresholding algorithm, in the presence of white Gaussian 
random noise. These two contributions are mentioned here 
since both these papers and the work reported here correspond 
to the simplest of all pursuit methods - the thresholding 
algorithm. 

While the synthesis model has been extensively studied, 
there is a dual analysis viewpoint to sparse representations 
that has only recently started to attract attention lITZl . ifTSl . 
m. El, IIll, E], ESI, im, EOl, EH, ma. The analysis 
model relies on a linear operator (a matrix) ft e R^^*^, 
which we will refer to as the analysis dictionary, and whose 
rows constitute analysis atoms. The key property of this 
model is our expectation that the analysis representation vector 
Jlx e MP should be sparse with £ zeros. These zeros carve out 
the low-dimensional subspace that this signal belongs to. We 
shall assume that the dimension of this subspace, which is 
denoted by r is indeed small, namely r '^ d. 

While this description of the analysis model may seem 
similar to the synthesis counterpart approach, it is in-fact very 
different when dealing with a redundant dictionary p > d. 
Until recently, relatively little was known about this model, 
and little attention has been given to it in the literature, 
compared to the synthesis counterpart model. Several recent 
works have already started to treat some of the basic research 
questions arising from the analysis model, such as how to 
perform pursuit with this model |1T6) . ||20) . ll22l . what are the 
theoretical performance guarantees for the suggested pursuit 
algorithms O, QSl, El, EQI, E] and how to learn an 
analysis dictionary from a set of signal examples ifTSl . ifTSl . 
lfT9l . Il22l . We shall return to some of these contributions 
towards the end of this paper, and discuss their relation to 
our work. 

The main goal of this paper is a theoretical study of the 
analysis thresholding pursuit algorithm, deriving conditions 
for its success in recovering the co-support in the presence of 
additive noise. A by-product of this study is an identification 
of two complementary measures of goodness that characterize 
the analysis dictionary. The first is the degree of linear 



dependencies between rows in f2, which is depicted by the 
co-sparsity level. This property has already been noticed and 
discussed in previous works on the analysis model ||20) . Il22l . 
The second property, termed the Restricted Orthogonal Pro- 
jection Property (ROPP), is the level of independence between 
such dependent sets and other rows taken from the analysis 
dictionary. To the best of our knowledge, this is the first time 
that this property has been used in the published literature. 
In this paper we derive an explicit relation between these 
properties and the expected performance of analysis pursuit 
by means of thresholding. We demonstrate the goodness of 
our theoretical findings by matching them versus empirical 
performance results. 

This paper is organized as follows: In Section HIl we present 
the core concept of the analysis-based model, characterize the 
signals that belong to it, and discuss the notion of linear depen- 
dencies within the rows of the analysis dictionary. In Section 
Uni we present the analysis pursuit problem of denoising a 
signal using the analysis model and suggest the thresholding 
algorithm for solving this problem. We test the performance 
of this algorithm in a series of synthetic experiments for 
different types of analysis dictionaries. A theoretical study 
of the performance of the analysis thresholding algorithm is 
conducted in Section |IVl We begin by developing theoretical 
success guarantees for the thresholding algorithm and discuss 
the dictionary properties arising from this theoretical analysis. 
Then we revisit the empirical results in light of the developed 
theoretical guarantees. Section FVl discusses the relation of this 
work to existing contributions, and Section |VT] concludes this 
paper 

II. The Analysis Model and its Dictionary 

A. Basic Properties of the Analysis Model 

This section begins with a brief review of the analysis- 
based model. The analysis model for the signal x e M'' 
uses the possibly redundant analysis dictionary il G R^^'', 
where redundancy here implies p> d. Throughout this paper 
the jth row in il will be denoted by wj. A fundamental 
property of this model is the assumption that the analysis 
representation vector fix should be sparse. In this work 
we consider specifically (q sparsity, which implies that fix 
contains many zeros. The co-sparsity £ of the analysis model 
is defined as the number of zeros in the vector fix. 



\flx\\o=p-£. 



(1) 



In this model we put an emphasis on the zeros of Jlx, and 
define the co-support A of x as the set of f = |A| rows that 
are orthogonal to it. In other words, flj\x = 0, where fl\ is 
a submatrix of fl that contains only the rows indexed in A. 
We also define the co-rank of a signal x with co-support A 
as the rank of fl\. The signal x is thus characterized by its 
co-support, which determines the subspace it is orthogonal to, 
and consequently the complement space to which it belongs. 
Just like in the synthesis model, we assume that the dimension 
of the subspace the signal belongs to, denoted by r, is small, 
namely r <C d. The co-rank of such an analysis signal is d~r. 
How sparse can the analysis representation vector be? The 



answer to this question is directly related to the existence of 
linear dependencies within the rows of the analysis dictionary. 
This will become more clear in the next subsection where we 
discuss in detail the effect of having such dependencies on the 
possible co-sparsity levels. 

B. Linear Dependencies in the Analysis Dictionary 

To motivate our discussion on the advantage of having linear 
dependencies within the rows of the analysis dictionary, let 
us first assume that the rows in fl are in general-position, 
implying that every subset of d or less rows are necessarily 
linearly independent. This is equivalent to the claim that the 
spark of fl is full ^. Naturally, for this case, i < d, since 
otherwise there would be d independent rows orthogonal to 
X, implying x = 0. Thus, in this case the analysis model 
leads necessarily to a mild sparsity, ||rix||o > p — d, and 
for a highly redundant analysis operator, the cardinality of the 
analysis representation vector Jlx is expected to be quite high. 
In this case, the dimension of the subspace the signal belongs 
to is r ~ d—£. An example for such a dictionary is a Gaussian 
random one, denoted flRAND, where the rows are drawn 
identically and independently from a normal distribution. 

A more interesting case is when fl has non-full spark, 
implying that linear dependencies exist between the dictionary 
atoms. The immediate implication is that £ could go beyond d, 
and yet the signal would not necessarily be nulled. An example 
of such a dictionary is the set of cyclic horizontal and vertical 
one-sided derivatives, applied on a 2D signal of size Vd x 
y/d. The corresponding analysis dictionary, denoted floip, 
is of size 2d x d, thus twice redundant. This dictionary was 
discussed in detail in 1201 . showing that its rows exhibit strong 
linear dependencies. 

Note that if we perform right multiplication of an analy- 
sis dictionary B by an invertible square matrix A then the 
resulting analysis dictionary ft = BA exhibits the same 
linear dependencies between its rows as in B. To see that 
this is indeed true, let A C {!,..., p} and suppose that 
there exists a vector 7 e R^ such that 7"^Ba = 0, namely 
the rows of Ba are linearly dependent. Then 7 also satisfies 
7^riA = 7^BaA = 0. For example, the rows of the analysis 
dictionary that is generated as flnix = fioiF-^, where A is 
a square matrix consisting of d Gaussian random rows, exhibit 
the same linear dependencies as fluiF- 

Fig. [U shows the three types of dictionaries mentioned 
above for p = 18, d = 9. Throughout this paper we will 
experiment with these three dictionaries. The reason for such 
low dimensional matrices is the fact that the study of the 
properties of the analysis dictionary will require exhaustive 
computations over all possible 2^ co-supports. In particular, 
these dictionary properties will appear in the performance 
guarantees we are about to derive for the analysis thresholding 
algorithm (see Section II V- Al l. Towards the end of this paper 
we will replace the exact dictionary properties by approximate 
ones, which are obtained from a set of signal examples 
generated from the dictionary. This will allow us to show 
theoretical results also for higher dimensions and check how 
well they predict the empirical results (see the end of Section 






. 




i ; i 






! 


- -A 

I 
\ 


' 


42 
o 

"E 0.8 




~~^ 


N 






- 


g-0.6 








\ 




1 


- 


"^ 








\ 








>-■ 








\ 








gO.4 










\ 




1 


|0.2 


— •*— S^DIF 

- A - SIrand 








\ 




1 





O i^MIX 










\ 


X 









4 5 6 7 

Number of rows 



Figure 1. Three types of analysis dictionaries of size 18 X 9: Left - ft^jp. 
Middle - Urand, Riglit - Umix- Eacii dictionary atom is displayed as a 
2D patch of size 3-by-3. 



Figure 3. The signatures for three types of analysis dictionary of size 18 X 9 
that were shown in Fig. [T] As can be seen, both floip and fJ^/zx have the 
same signature, which is strictly lower than 1 for k > 3. Therefore the spark 
of these dictionaries is 3, namely it is non-full. For i^RAND however the 
signature equals 1 for all fc = 1, . . . , 9 and therefore its spark is d+ 1 = 10. 
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Figure 2. The effective co-sparsities corresponding to each type of analysis 
dictionaiy of size 18x9: Top - d£)jp. Middle - fljiAND^ Bottom - fl]\^jjx- 
For each type we show the exact co-sparsity distribution, which is computed 
exhaustively for all possible co-supports corresponding to a co-rank of 7. 
We also show an empirical normalized histogram, which is computed from 
10, 000 analysis signals of co-rank 7 that were generated using the process 
described in the beginning of Section IIII-CI The reference value of ^ = 7 is 
indicated by the vertical dotted line. As can be seen, the effective co-sparsities 
are all strictly higher for both iljjiF and Clf,£jx- 



As mentioned above, when the rows in Jl are not in general- 
position, the co-sparsity £ can be greater than d. This behavior 
is demonstrated in Fig. |2] showing the distributions of £ for the 
three types of Jl shown in Fig. [T] and co-rank 7. For each type 
the exact co-sparsity distribution is computed exhaustively 
for all possible co-supports corresponding to a co-rank of 
7. We also show an empirical normalized histogram, which 
is computed from 10, 000 analysis signals of co-rank 7 that 
are generated using the process that will be described in the 
beginning of Section IIII-CI As can be seen the distribution 
for il^DiF and ri^z/x coincide, as should be expected from 
the observation mentioned above (both dictionaries exhibit 
the same linear dependencies between their rows). In both 
cases, though the signals have a fixed co-rank 7, their actual 
co-sparsities are much higher, varying in the range 8 to 14. 
Interestingly, odd co-sparsity values cannot lead to the chosen 
co-rank, as indeed seen in Fig.|2] Thus, we see that by allowing 
hnear dependencies between the rows in $7, co-sparsities much 
higher than the signal dimension d can be achieved. 

An alternative measure for the linear dependencies between 
sets of rows in $7 is the signature of the analysis dictionary, 
which is defined as the ratio of linearly independent sets 
of k rows out of all possible sets of size k - this ratio is 
denoted by f{k) 1231 . Since every set of size at least d + I 
is necessarily hnearly dependent, it is sufficient to compute 
the ratios mentioned above for k = I, . . . ,d. The spark of 
ft can be readily computed from the signature f{k) - it is 
the smallest index k such that /(fc) < 1. The signatures of 
the three analysis dictionaries that were shown in Fig. [T] are 
depicted in Fig. [3] Clearly, ^dif and ^Imix have the same 
signature, as they exhibit the same linear dependencies. Their 
signature is much lower than for ^b^and whose signature 
equals 1 for all k = l,...,d. We observe that the spark 
of fljjjp and fl]\fjx is 3, whereas the spark of ^j^and is 
d + 1 = 10 (i.e. the spark is full). To conclude this section, 
note that a lower dictionary signature indicates that there are 
more linear dependencies within its rows, and these allow for 
larger co-sparsity levels. 



III. Analysis Thresholding 



B. The Thresholding Algorithm 



A. Analysis Pursuit 

In this paper we assume that x is a co-sparse analysis signal 
with co-rank d—r, and this signal is contaminated by additive 
noise, y = x + e. Starting with the oracle setup, where the true 
co-support A is known, we can simply recover x by projecting 
y onto the subspace orthogonal to JIa^ 



x= (I- fi^rjA 



(2) 



Assuming a deterministic signal x residing in a r-dimensional 
analysis subspace and white and zero-mean Gaussian noise v 
with variance a^, the mean denoising error in the oracle setup 
is given by 



Elix- 



ir I 



(3) 



where tr(-) denotes the trace of a matrix. For more details see 



In the general case the correct co-support is unknown and it 
should be estimated from y. Recovering the noise-free signal 
X requires solving a problem of the form 



{x,a} 



Argmin 

x,A 



yj|2 Subject To 

Oax = 
Rank(r2A) = d ~ r 



(4) 



We refer to this problem as the analysis sparse-coding or 
analysis-pursuit. This problem can be readily reformulated as 
a two-step recovery process. To eliminate the dependency on 
X we can place the oracle formula of ^ into the problem of 
dUi. We get that recovering the co-support A results in solving 
the problem 



A = Argmin IIOj^JlAylh Subject To 
A 

Rank (Oa) = d — r 



(5) 



Once the co-support has been recovered we can project y onto 
the orthogonal subspace (using Q), just as in the oracle setup. 
Similar to the synthesis sparse approximation problem, the 
problem posed in Eq. ^ is combinatorial in nature and 
can thus only be approximated in general. One approach for 
approximating the solution is to use a relaxed £i penalty 
function on the coefficients Jlx, producing 

X = Argmin ||x — yj|2 Subject To ||rix||i < T. (6) 

X 

This approach is parallel to the basis-pursuit approach for 
synthesis approximation ll24l . A second approach parallels the 
synthesis greedy pursuit algorithms ||251 . ||261 and suggests 
selecting rows from ft one-by-one in a greedy fashion. The 
solution can be built by either detecting the rows that corre- 
spond to the non-zeros in Jlx, or by detecting the zeros. The 
GAP algorithm, described in pOl, aims at detecting the non- 
zeros, whereas the BG and OBG algorithms developed in 
detect the zeros. 



In this work we will take the alternative (and simpler) 
approach of thresholding. This algorithm computes the anal- 
ysis representation fly and chooses the smallest entries as 
the estimated co-support. Thresholding will always obtain a 
perfect recovery of the co-support in noise-free setups since 
J^ax = and |w^x| > for all j G A'-'. We suggest using 
it also in the presence of noise. A detailed description of the 
analysis thresholding algorithm is given in Algorithm [T] 



Algorithm 1 Analysis Thresholding Algorithm 



1: Input: Analysis dictionary fl e K^^'', signal y e M'', and 

target co-rank d — r 
2: Output: Signal x e M'' with co-rank d — r approximating 

the minimization of j|y — x||2 and its co-support A 

3: Inner Products: Zk := \ w^ y | , Vfc = 1 , . . . , p 



4: 



Sort: Set T to be the index set { 1, . . . ,p} sorted by the 
value of Zk in increasing order 
Initialization: Set i = 0, A := 

while Rank (J^a) < d — r do 

i -.^i + l 

Update Co-Support: A := A U { Ti } 
end while 



f^Jvf^A) 



10: Project: x = ( I 

11: Refine Co-Support A = {fc 1 1 < fc < p, | w^ x | < eo} 



The process begins by computing the inner products be- 
tween all the rows in $7 and the signal y and sorting the 
index set { 1 , . . . , p } according to the magnitudes of these 
inner products in increasing order, resulting in a new index 
set r. The co-support is initialized to be an empty set. We 
then accumulate rows into the co-support, in a row-by-row 
fashion, according to their order of appearance in the set 
r. This process repeats until the target co-rank is achieved, 
namely Rank (JIa) ^ d — r. The solution x is then computed 
by projecting y onto the subspace orthogonal to the selected 
rows. Finally, the co-support is refined by recalculating the 
representation vector Jlx and finding the additional coefficients 
that fall below some small threshold eg- This can reveal 
additional rows that are orthogonal to the signal estimate, 
namely the rows that are spanned by the existing set of rows 
rt\. Despite the fact that the last step ("Refine Co-Support") 
has no impact on the signal recovery, it is still significant for 
our purposes, as our study checks the correctness of the found 
co-support. 

In practice, the above algorithm can be implemented ef- 
ficiently by accumulating an orthogonalized set of the co- 
support rows using a modified Gram-Schmidt process. This 
process is applied according to the order of appearance in the 
set r. Denoting by {q,}od the orthogonal set accumulated 
so far (as column vectors), the orthogonalization of a new row 



Wp is obtained by 



WPi 



J 



(qTwrJq, 



(7) 



If Qj equals zero, it is not added to the orthogonal set, as it is 
already spanned by the existing one. Otherwise, this vector is 
normalized, q, = q,/||qj|2- 

The above-described orthogonalization process allows us 
first of all to avoid the computation of the rank of the subma- 
trix JIa, since the number of vectors in the orthogonalized set 
(J) equals the desired rank. Secondly, the orthogonalized set 
{q } ~[ can also be used to avoid the matrix inversion in the 
"Projection" step, which translates comfortably to 



I 



I 






(8) 



C. Synthetic Experiments 

We now demonstrate how the thresholding algorithm (see 
Algorithm [T]i performs through a series of synthetic exper- 
iments. Throughout this subsection we shall assume that 
the analysis signals are generated by the following process: 
Choose randomly a set of row indices A C {1, . . . ,p}, which 
will be the signal's co-support. Starting with a random vector 
u, whose entries are assumed to be drawn independently 
and identically from a zero-mean Gaussian distribution with 
variance cr^, project it onto the subspace orthogonal to J^a: 



X = (I - n]^flA)u, 



(9) 



and X is an analysis signal that satisfies our co-sparsity 
assumption. For a general-positioned ft we choose exactly 
£ rows from Jl at random. Otherwise we choose d ~ r 
hnearly independent rows from Q. Once a signal x has 
been generated, its analysis representation fix is re-computed, 
possibly revealing additional rows that are orthogonal to this 
signal, due to linear dependence on the chosen subset A. 

We generate A^ = 10, 000 analysis signals in MP residing 
in 2-dimensional subspaces for the three types of analysis 
dictionaries shown in Fig. \T\- normalized histograms of their 
effective co-supports are depicted in Fig. |2] These signals are 
contaminated with additive white Gaussian noise at different 
noise levels cr, resulting in a set of noisy signals {yj}.^i 
for each dictionary type and noise level. The thresholding 
algorithm is then applied on these signals with a target co- 
rank of d — r = 7. Results are shown in Fig. |4] for various 
signal-to-noise ratios (SNR) in the range 6dB to 74dB. Each 
SNR level is related to the ratio cr/(T„ by 



SNR^ lOlogi 



Eilxll 



E||y-xy 



= -20 log- 




where in the last equation we used the equation E||x||2 = 
rcr^, which holds since x is a zero-mean 



tr(I-0]^OA)cr; 



Gaussian vector with a covariance matrix (I — rJlJlA a, 



>t, 



(exhibiting a similar form as in the oracle error - see Eq. (|3)), 
and E||y— xjll = da^. At this point we should mention that the 



SNR levels shown on the right part of the figure are very high 
ones (for example SNR=60(ii3 means that the signal energy 
is 1000 times the noise energy). Setups with such high SNR 
levels can be considered as almost noise-free. Therefore we 
expect that the thresholding algorithm will obtain a perfect 
recovery of the co-support in these setups, just like in the 
noise-free setup. 

In Fig.|4]we can see on the top the empirical probability of 
success for the thresholding algorithm on each of the dictio- 
naries. Note that "success" refers here to an exact recovery of 
the true co-support. On the bottom we can see the denoising 
performance, measured as the average SNR improvements 
(ISNR): 



ISNR = -lOlog^ 



X — x| 



(11) 



These are also compared with the oracle performance, which 
corresponds to an ISNR of — lOlogj^g ('^7'^) = 6.53dB. We 
can see at the top right comer of the figure that thresholding 
succeeds with probability one for all three types of dictionar- 
ies, which aligns with our expectations for high SNRs that 
were mentioned before. 

Several important observations can be drawn from the 
results shown in Fig. |4] First of all, we can see that the 
probability of success decreases as the SNR deteriorates. This 
aligns with the simple intuition that the higher the noise, the 
higher the chance of any pursuit algorithm to make mistakes in 
the co-support detection. Second, the highest success ratio and 
ISNR are obtained for ^dif at all noise levels; the second- 
best results relate to Umix and the worse to ^rand- 

The observation that flRAND exhibits the worst perfor- 
mance does not come as a surprise to us. The fact that having 
many linear dependencies in an analysis dictionary fl leads 
to better denoising results has already been observed in a 
previous work ll22l . However, the performance gap between 
^DiF and flMix is not obvious at all, if we recall that both 
exhibit the same linear dependencies between their rows (and 
hence the same co-sparsity distribution). This calls for a deeper 
theoretical study of the thresholding algorithm, which is the 
topic of the next section. 

IV. Theoretical Study of Analysis Thresholding 

This section consists of the main contribution of this paper: 
A theoretical analysis of the capability of the thresholding 
algorithm to recover the true analysis co-support in the pres- 
ence of additive noise, and the implications of this analysis. 
We start in Section ITV-AI with the derivation of our main result 
- a lower-bound on the probability of successfully recovering 
the co-support by the analysis thresholding algorithm. Section 
IIV-BI discusses the obtained results and specifically the mean- 
ing of the measures proposed for the analysis dictionary. In 
Section IIV-CI we revisit these results in an attempt to explain 
them further, and contrast them with the empirical evidence 
we have just created. As this work focuses on the probability 
of the analysis thresholding algorithm to recover the exact co- 
support, the relative denoising performance will not be further 
explored in this paper and remains a topic for future research. 
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Figure 4. Denoising experiments with analysis signals of co-rank 7 created from the three types of analysis dictionaiies of size 18 X 9 that were shown in 
Fig. [T] Additive white noise is added to each of these signals for varying noise levels and then the thi'esholding algorithm (see Algorithm [T] is applied on 
each signal to obtain a recovery of its co-support and its resulting denoised signal. Top: The empirical probability of success in recoveiing the true co-support 
for the thi'esholding algorithm on each of the dictionary types. Bottom: The noise attenuation performance obtained for the thresholding algorithm on each of 
the dictionary types. These are compared with the oracle result, where denoising is obtained by projection onto the correct analysis subspace (knowing the 
true co-support of the signals). 



A. Theoretical Guarantees for Analysis Thresholding 

Before we turn to the development of the theoretical guar- 
antees for the analysis thresholding algorithm, we would like 
to set some basic assumptions and notations. First, we assume 
that all the rows in Jl have unit-norm. Secondly, we denote 
an index set of d — r linearly independent rows taken from A 
by A C A, namely Span {^2^} = Span {JIa}. Finally, given a 
noise-free signal x and an analysis dictionary Jl, let us define 



= Min 



I T I 
|Wj x| 



(12) 



where A is the co-support of Jlx and A*^ is the complementary 
index set. For the co-sparse analysis signal x we have that 
Oax = 0, implying that JIac'X ^ 0. The value of Zmi„ is the 
smallest of those non-zero inner-products with JIac, and it 
plays a major role in the ability of the thresholding algorithm 
to tell the right co-support rows from the rest in the noisy 
case. We begin our performance study of this algorithm with 
a sufficient condition on z„ii„ for success. 

Lemma 1. Let y = x + e, where x is a co-sparse anal- 
ysis signal with co-support A on CI. If x and CI satisfy 



|H'J'e|, then the thresholding algorithm 



succeeds in recovering the true co-support A of x from y. 

Proof: We begin with the simple observation that the 
thresholding algorithm succeeds in recovering the true co- 
support A of X when 



Max 

jeA 



w, y < Min 



I T I 



Since wjx 
translates to 



for all j e A the left-hand side of ([T3]l 

(14) 



I T I 

|wje| 



Max |wj y| = Max 

jeA jeA 

For the right-hand side of (fTsT l we derive a lower bound 



Min 

jeAC 



T 



Iw:; yl > Min 



\v/j x| — |wj e| 



> z„ 



-Max 

J6AC 



I T I 



(15) 



where the first inequality holds from the triangle inequality 
and the second holds from the properties of the minimum and 
maximum operators, 

Min (/ - g) > Min / + Min {-g) = Min / - Max g. (16) 

From (fT3]l-(fT5]l we get that a sufficient condition for success 
of the thresholding algorithm is: 



Max 

jeA 



w,- e < z„ 



- Max 

J6AC 



I T I 



(17) 



which can be comfortably replaced by the sufficient condition 

(18) 



Zmin > 2 Max 



I T I 



since 



2 Max Iw^el > Max Iw^el + Max 
jeAuAC jeA ieAC^ 



w,- e 



(19) 



Note that so far we have made no specific assumptions on 
the signal generative model or the noise. The only assumption 
is on the inner products between the signal x and rows in 
CI that are not indexed in the true co-support. An immediate 
observation arising from the above lemma appears in the 
following corollary. Using the Cauchy-Schwarz inequality and 
the fact that all rows in Ct are normalized, we get that 
< llelU. Thus, 



I T I 

|wj e| 



Corollary 1. Let y ^ x +e, where x is a co-sparse analysis 
(13) signal with co-support A on CI and \\e\\2 ^ £■ If x and Ct 
satisfy Zmin > 2e, then the thresholding algorithm succeeds 
in recovering the true co-support A of x from y. 

Note that we have referred to the noise as deterministic and 
bounded. This results in a very pessimistic success condition, 
as should be expected for a worst-case performance analysis 
like the one practiced here, in which an estimator must perform 
well even when the noise maximally damages the measure- 
ments (the noise in this case is thus called adversarial). This 
should remind the reader of the theoretical guarantees derived 



for synthesis-based pursuit algorithms under adversarial noise 

m, m, 0, 0, 0. 

To improve the theoretical guarantees, we turn to a setup 
where the noise is assumed to be random. Specifically, we 
assume white and zero-mean Gaussian noise with variance 
(7^, and derive a lower bound on the probability of success 
under a sufficient condition on Zmm- 

Theorem 1. Let y = x + e and e ^ N (O, a^t). If x is a co- 
sparse analysis signal with co-support A on 17, co-sparsity 
i, and co-rank d ~ r, and Jl and x satisfy z,n.in > P<^, 
then the thresholding algorithm succeeds in recovering the 
true co-support K of x from y with probability at least 

Max {o,l-yg;exp{-f}}J 

Before turning to prove this result, a short discussion is in 
order This theorem provides a lower bound on the conditional 



p-t+d- 



probability of success given that 



> /3(T. The derived 



expression has an exponential form with a base in the range 
[0, 1] depending on (3 and a power p — (, + d— r. In the rest 
of the paper we will denote the base of this exponential form 
by 



5(/3)-Max <jO,l- J^cxp<i-^ 



(20) 



The observant reader might ask at this stage: Why is the 
performance guarantee of Theorem [T] better than the result 
of Corollary [Tf To answer this question we explore the 
dependence of this performance guarantee on (3. The bound 
on this probability increases exponentially from zero to one 
as /3 grows, but at the same time the condition on Zmin 
becomes stricter This bound is shown in Fig. |5] for a setup 
with d = 9, p = 18, 7' = 2 and £ ~ 14. First, we can 
see that the exact co-support is recovered with overwhelming 
probability (i.e. near one) for Zmin > 6cr. This aligns with 
the guarantee of Corollary [T] requiring Zmin > 2e, where e is 
of order y/da = 3cr. More importantly. Theorem [T] provides 
probabilistic success guarantees for weaker conditions on 
Zmin, for which Corollary [T] cannot make any guarantee. 

Next, we explore the dependence of the obtained lower 
bound on the number of atoms p and the co-sparsity (. and 
the co-rank d — r. Clearly, the probability of success of the 
thresholding algorithm improves (grows) when p — i + d — r 
gets smaller. Such is the case, for example, when the dictionary 
size {p, d) is kept fixed, the co-rank d — r is chosen as well, 
and the level of dependencies, as depicted in £, grows. This 
manifests the surprising fact that strong linear-dependencies 
within O lead to better performance. Adopting a different point 
of view, when p (the dictionary's redundancy) grows, the level 
of performance may remain the same as long as £ grows with 
it such that their difference remains unchanged. 
Proof: Let us first define the event 



B 



Max 

jeAuAC 



I T I 

|wj e| 



<T 



(21) 



A similar event was defined in ifTTl when developing success 
guarantees for the synthesis-based thresholding and OMP 




Figure 5. The dependence on /3 of the lower bound on the conditional 
probability of success given that Zmin > /3o" (see Theorem [T) for a setup 
with d = 9, p = 18, r = 2 and e = 14. 



algorithms. We start by deriving a lower bound on the proba- 
bility of this event: 

-l+d-r 



Pi-{B}> Yl Pi-{|wJe|<T} 

jSAuAC' 



> 



l-^/^°-p{-^ 



1 - 2Q - 
a 



- p—t+d—r 



where Q(-) is the Gaussian distribution tail. 



Qit) 



cxp 



dz. 



(22) 



(23) 



The first inequality holds due to Sidak's lemma 1271 for a set 
of jointly Gaussian jandom variables. The next equality holds 
due to the fact that A and A'-^ are disjoint sets of sizes d—r and 
p — £ respectively. In the last inequality we use a well-known 
upper bound on the Gaussian distribution tail, 

2 



Q{t)< 



1 



tV27r 



exp 



(24) 



We set T — ^f3(j, and thus the event B corresponds to 
all the noise vectors e satisfying 2Max.^ ^c WJ^l < Z^"'- 
Therefore, if z,nin > (3a as this theorem states, then neces- 
sarily Zmin also satisfies the condition of Lemma [T] namely 
Zmin > (3cr > 2Max . -j^^j^c l"**?"^!' which guarantees the 
success of the analysis thresholding algorithm. The probability 
for this to happen is bounded from below by the expression 
we have derived in Eq. (l22l i. as claimeclJ. 

■ 

Next, we would like to eliminate the dependence on Zmin 
and derive a theoretical guarantee in terms of the analysis 
subspace dimension r, the co-sparsity £ and possibly some 
internal properties of the dictionary ft. This will help to 
reveal what makes an analysis dictionary more suitable for 
co-sparse estimation. To initiate such an analysis, we make an 
additional assumption on the signal generative model. Given 
a dictionary Jl, a co-support A and a random Gaussian vector 

'For values of /3 that lead to a negative argument in this expression we 
replace Eq. | |221 by a trivial zero lower bound on the probability. 



u ^ A^(0, cr^l), X is generated by projecting u onto the 
subspace orthogonal to Cl\, as described in Section IIII-CI 
(see (|9]l). We further assume that u and e are statistically 
independent. Using this generative model for x, we shall 
derive a theoretical guarantee for success of the thresholding 
algorithm, based on a new property of ft we shall refer to as 
ROPP: 

Definition 1. Given an analysis dictionary J7, the Restricted 
Orthogonal Projection Property (ROPP) of this dictionary with 
a constant a^ is defined as 



Min 



\{i-n{nA)w 



jl|2- 



(25) 



More on the meaning of this constant is brought in Section 
IIV-BI Armed with this definition, we now turn to improve 
Theorem [T] by removing the dependency on z„i,„. 

Theorem 2. Let y = x + e, where u ^ N {Q, cr^/), x is a 
co-sparse analysis signal with co-support A on O, obtained 
by X = (I — Q,\Q,jC}u, and e ^ N (0,ct^/) is the additive 
noise statistically independent of u. If ^ satisfies the ROPP 
with a constant Ur and x has co-rank d^r and co-sparsity i 
on J7, then the thresholding algorithm succeeds in recovering 
the true co-support K of x from y with probability at least 

[gim"''^""' (20 {■i^)Y'' for any constant /3 > 0. 

Note that the function g() appearing in this theorem is defined 
in Eq. (|20] | and Q(-) is the Gaussian distribution tail (see Eq. 
(|23]l). 

Just as we did for the conditional probability of success 
of Theorem [T] we start by exploring the dependence of the 
resulting bound with respect to /3. This is shown in Fig. |6] for 
a setup with d = 9, p = 18, r = 2, £ = 14 (same as before - 
see Fig.|5]l, a,. ~ 0.75 and cr/(7„ = 0.01. We can see that the 
choice of /3 is crucial for the strictness of the resulting lower 
bound on the probability of success. For the setup considered 
here the optimal value of /3 is 6, which results in a lower 
bound of 0.744. The lower bound appearing in this theorem is 
a product of two exponential terms. The first is the bound on 
the conditional probability that appeared in Theorem[T]and the 
second terms is a bound on the probability that the condition 
Zmin > /3o' holds (this bound will be derived in the proof 
that follows). The first terms grows with /3, while the second 
decreases, thus explaining the peak between and infinity. 

Next, we explore the dependence of the obtained lower 
bound on the number of atoms p and the co-sparsity €, 
fixing the noise ratio ajou, the signal dimension d and 
the analysis subspace dimension r, and assuming that the 
dictionary satisfies the ROPP with a constant a^. Since both 
the bases of the exponential terms are in the range [0, 1], we 
can see that the probability of success of the thresholding 
algorithm improves when the difference p—£ becomes smaller 
This means that the same observations made before on p and 
i for the conditional probability also hold here: For a given 
dictionary of size (p, d) performance improves as i grows, 
and when the redundancy of the dictionary is increased the 
performance remain the same as long as the difference p — i 
remains unchanged. Finally, we observe that since (5() is 
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Figure 6. The dependence on fi of the lower bound on the probabiUty of 
success of Theorem |2] for a setup with d = 9, p = 18, r = 2, £ = 14, 
cr/cu = 0.01 and Or = 0.75. For this setup the optimal value of /3 is 6, 
which results in a lower bound of 0.744 on the probability of success. For 
each value of /3 we also show the lower bounds on the conditional probability 
of success of Theorem [Tjand on the probability that the condition z^in > /9o" 
holds (see Eq. (26)). The final bound of Theorem|2]is a product of these two 
bounds. 



monotonic decreasing, the performance improves as the noise 
ratio <j l<Ju decreases or the ROPP constant a,, grows. 

Proof: We begin by observing that a signal x generated 
as an orthogonal projection of a Gaussian i.i.d. vector u is 
also Gaussian, x ~ A^ I 0, cr2(I — HJ^J^a)) and so is any 

inner product with x, wjx ~ A^ (O, ||(I— JlJ^^JlA)wj||2cr2). 
Using this observation, we now derive a lower bound on the 
probability that the condition for success of Theorem [1] holds: 



^^{Zrmn > /3cr} == Pr 



Min 

jeAC 



I T I 
\Wj x| 



> f3a 



> 
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jeAC 



Pr{|wjx| 



n 2Q 



>/3a} 



13a 



(26) 
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The first inequality relies on Sidak's lemma, as beforq^- In the 
next equality we use the fact that wJx is Gaussian with the 
variance mentioned above. The last inequality holds from the 
definition of the ROPP in dZSl ) and since Q() is monotonic 
decreasing. The power p — I comes from the cardinality of the 
set A*^. 

Combining Theorem [T] and Eq. (fTsl l we get that the final 
lower bound on the probability of success for the thresholding 
algorithm is a direct multiplication of the two probability 
expressions, leading to the claimed lower-bound probability 
posed in terms of the ROPP constant a^ and the co-sparsity 
L ■ 

^In fact, we are not explicitly using Sidak's lemma, but a re- 
lated inequality resulting from this lemma. Let {dj}J£j^ be a set of 

jointly Gaussian random vectors. Then according to Sidak's lemma, 

M 
Pr {Maxi<j<iv/ l^jl < "J"} > Y\ P^'iK'jl <"''}■ Thus, turning 

j=i 
to our expression, we observe that Pr {Mini<j<j\/ |j;j| > t} = 

Pr{-Maxi<j<A/ {-\vj\) > r} = Pr{Maxi<j<A/ (-\vj\) < -r} > 

M M 

Y\ Pr{— liijl < — r} = Y\ Pr{|-Uj| > r}, leading to the relation we 
j=i j=l 

used. 



B. Discussion on the Properties of the Analysis Dictionary 

We begin this subsection by taking a closer look at the 
ROPP. This is an internal property of the analysis dictionary, 
indicating for a set of d— r + 1 linearly independent rows from 
the dictionary how much each row is spread away from the 
subspace spanned by the rest. At the special case of a unitary 
dictionary $7 we have ar — \ for all values of r since each 
row is orthogonal to the subspace spanned by every possible 
set of rows not including it. How does the ROPP compares to 
other dictionary properties? Starting with the RIP ||6], Q, 



{l-5u)Ml<\\Dy\\l<(l 



-4) Hi 



(27) 



all fc-sparse vectors v e R", the ROPP 
2 norm related to the dictionary. However, 



which holds for 
also bounds an j 
the ROPP looks at projection matrices constructed from the 
dictionary instead of the dictionary itself as in the RIP, and 
applies these matrices on dictionary atoms not used for the 
matrix construction instead of looking at all possible signals 
with a certain sparsity as in the RIP. This should remind the 
reader of the ERC ||5], which has a similar flavor Turning to 
the ERC f5\, for a better comparison let us replace the ROPP 
by the sufficient condition 



Max 



f^Jv^AW.Ib < 1 



(28) 



for the same co-supports A as in (l25l l. To see that this is indeed 
a sufficient condition, we assume that ( |28] | holds and show that 



I - 0]^0a 



w.lb > ||w,||2 - ||rj]^rjAW,||2 > a., (29) 



where in the first inequality we used the well-known relation, 

IKi — V2IJ2 > IIK1II2 — IIV2II2I1 which holds for any pair of 
vectors Vi,V2, and in the second inequality we used the fact 



that ||wj||2 = 1 and the assumption of (128b . The condition 
appearing in (l28T l has a similar structure to the ERC, 



Max 



llDtd, 



< 1. 



(30) 



However, there are two inherent differences: The pseudoin- 
verse of the submatrix D^ is replaced by a projection matrix 
onto the null space of JIa and the ii norm is replaced by £2- 
Consequently, an upper bound of 1 is a trivial one and it is 
replaced by the stricter bound 1 — a^ for some constant a^. 

Next, we turn to the theoretical guarantee of Theorem |2]and 
observe that it gives rise to two dictionary properties, which 
serve as two distinct forces dictating the ability to recover the 
co-supports of analysis signals over the given dictionary. The 
first property, emanating from the signature or the co-spars ity 
of Jl, determines which sets of rows and how many of them 
are linearly dependent. However, this measure by itself does 
not provide us with any quantitative relation between these 
sets and the rows that are linearly independent on them. The 
second property focuses exactly on these missing relations, 
telling us how much a row is spread away from the others, 
provided that it is linearly independent on them. 

Are these two dictionary properties somehow related to each 
other? To provide an answer to this question we explore the 
joint distribution of the two. For this purpose, we replace a^ 
by a^ which has a similar definition, apart from a delicate 
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Figure 7. The joint distribution of £ and a^ for each type of the analysis 
dictionaries of size 18 X 9 that were shown in Fig. [T]and for r = 2. Each of 
these distributions is obtained by an exhaustive computation over all possible 
subsets of rows from the analysis dictionary with co-rank 7, and is displayed 
in the form of a matrix P' ', whose entries where defined in Eq. J3U . A 
darker bin corresponds to a higher value in the joint distribution. 



modification: It should satisfy ( |25] l for a single co-support A 
corresponding to a co-rank d ~ r, rather than for all possible 
co-supports leading to this co-rank, as in the definition of a^ 
(see Definition [Til. This means that a^ can be obtained by 
taking the minimal value of a^ over all of these co-supports. 
Since a^ is a continuous measure in the range [0,1], and 
since we are about to create histograms of possible values, we 
perform a uniform quantization of a^ to T = 100 discrete 
levels. The joint distribution of C and a^ is represented by a 
p-by-T matrix with entries 



p{r) _ p 



^k 



771—1 



T 



<ai 



<r 



(31) 



Obtaining the entries of the matrix P''^' requires an exhaustive 
computation over all possible co-supports with co-rank d — 
r. The joint distributions for the three dictionaries (shown in 
Fig. [1]) and a co-rank of 7 (i.e. r = 2) are depicted in Fig. 
[7] We can see that increasing the co-sparsity level typically 
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Figure 8. The values of the ROPP constant for each type of the analysis 
dictionaries of size 18 X 9 that were shown in Fig. [T]and for varying analysis 
subspace dimensions r. Each of these values is obtained by an exhaustive 
minimization over all possible subsets of rows from the analysis dictionary 
with co-rank 9 — r. 



spreads a^ towards higher values. This makes sense since 
the minimization appearing in (IZST i is performed over smaller 
index sets. 

C. Results of the Analysis Thresholding Revisited 

We revisit the results shown in Section IIII-CI and try to 
explain them in light of the theoretical guarantees derived 
in Section IIV-AI Note that the setup considered in Theorem 
|2] (projection of a white Gaussian vector u, additive white 
Gaussian noise) matches completely the one used for the 
experiments of Section IIII-CI This will allow us to make 
the desired connections between the empirical results and the 
theoretical guarantee. An immediate observation arising from 
Theorem |2] is that the higher the co-sparsity level £ of x with 
respect to H, the better the thresholding algorithm is expected 
to perform in recovering the true co-support. This implies 
that linear dependencies within Vl are highly desired. This 
stands as a complete contradiction to the intuition gained for 
the synthesis-based sparsity model, where such dependencies 
between the atoms lead to a collapse of pursuit algorithms. 
We also observe that the results of the analysis thresholding 
algorithm improve as a^ grows. This is closer in spirit to the 
ERC/RIP rationale, where independencies are encouraged. 

Returning to the empirical results of Section IIII-CI we have 
already seen in Fig. |2] that VIdif and VImix have the same 
co-sparsity distribution, where the co-sparsity can be much 
higher than the co-rank d — r. This can explain, at least in 
part, their superior performance over VI hand, which allows 
only a constant co-sparsity level i = d — r. We now turn 
to examine the value of the ROPP constant for each type 
of dictionary, with a hope to reveal an additional inherent 
difference between the dictionaries. These values are shown in 
Fig. |8]for the three dictionary types and for varying analysis 
subspace dimensions r. To obtain each of these values we 
performed an exhaustive minimization over all possible subsets 
A of rows from $7 such that RankjriA} = d — r. We can 
see that CIdif corresponds to a much higher ROPP constant 
for all the examined co-ranks, when compared to CImix and 



^BAND- The two latter dictionaries have very low ROPP 
constants (below 0.14 for r < 5). Specifically, at a subspace 
dimension of r = 2 that was considered in the experiments 
of Section IIII-CI the ROPP constant is 5.6 times higher for 
rJc/F compared to ^mix and 202(!) times higher compared 
to ^RAND- We can conclude that the value of the ROPP 
constant explains the superior behavior of the thresholding 
algorithm with ftniF when compared to ftMix, as observed 
in Fig. |4] This dictionary property also provides additional 
grounds for the inferior behavior with flftAND- 

Next, we turn to examine the theoretical success guarantee 
provided in Theorem |2] Fig. |9] (top) displays this lower bound 
on the probability of success for the thresholding algorithm 
for each of the dictionaries and for varying SNR levels in the 
range 6dB to lAdB B To obtain each of the lower bounds 
that are shown in this figure, we find for each co-sparsity £ 
and each noise ratio cr/cr„ a value of (3 such that the lower 
bound for the probability of success provided in Theorem |2]is 
as tight (i.e. high) as possible. An example of how to choose 
an optimal value of /3 was depicted in Fig. |6] Finally, we 
perform a weighted average of these lower bounds, where the 
weights are simply the values of the co-sparsity distribution. 
This process can be described by the following equation: 



Pr{"Success"} = ^ Pr{£ = A:}Pr{"Success"|^ = k} 



fe=i 



>J2 Pr{^ = fc}[.g(/3.)r 
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PkCT 
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(32) 



where the function g{-) is defined in Eq. ( l20b and Pk is the 
value of j3 that is set for co-sparsity £ = k. These values are 
chosen such that the arguments inside the sum are maximized 
for each k separately. 

We can see that the resulting lower bounds can provide 
some insight into the actual performance. They are capable of 
predicting success with high probability at high SNR levels for 
0,DiF and Q,Mix- Another useful property of these bounds 
is that they clearly predict which dictionary the thresholding 
algorithm is expected to perform better with and which would 
probably lead to failure. Note that in our quest for theoretical 
guarantees we have lost much tightness with respect to the 
empirical results. This is typical for a theoretical analysis, but 
as we shall see in a moment, the tightness of the derived 
bounds can be considerably improved if we take into account 
the fact that a^ varies as a function of the co-support, and 
has a spread of values. Specifically, we can modify the process 
described in Eq. (|32] | by replacing the distribution of £ and the 
fixed worst-case value of ar with the joint distribution of (. 
and a^, as depicted in Fig.|2] For each such pair and for each 
noise ratio a/au we set an optimal value of [3 as described 
before, and use the values of the joint distribution as weights 
for the final average. This means that the process of (l32t is 

'See Eq. 110) for the definition of SNR and its dependence on a ja^. 
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Figure 9. Lower bounds on the probability of success for the thresholding algorithm on the three types of analysis dictionaiies of size 18 X 9 that were 
shown in Fig. [T] and for varying SNR levels. Top: For each ratio a/au a lower bound is computed using Eq. (32), where for each co-sparsity level £ we 
choose a value for /3 such that the resulting bound will be as tight as possible. Bottom: For each ratio cr/au a lower bound is computed using Eq. 1331 . 
where an optimal value for /3 is set for each pair £, a^. As can be seen, the bounds appearing on the right are tighter than those shown on the left. 



replaced by 

Pr{"Success"} 

p T 



k=lm=l ^ 



771—1 



T 



<a'}< 



p T 



>EE ptihgw^m)] 



-\p—k-\-d—r 



k—1 rn—1 



2Q 



(m - l)cr„ 



m 
T 

p~k 



(33) 



The resulting lower bounds are shown on the bottom of Fig. 
|9]and as can be seen, they are much tighter than the previous 
ones appearing in this figure on the top. 

Before concluding this section, we bring several additional 
experiments, this time with higher dimensional signals, in 
order to demonstrate the behavior of the thresholding algo- 
rithm, and the comparison between empirical performance and 
the theoretical forecasts. We consider signals of dimension 
d ~ 100 and three types of analysis dictionaries (same as 
before), each with p = 200 atoms. We test denoising setups 
where the true analysis subspace dimension r varies in the 
range [2, 25] and the SNR in the range 6dB to 75dB. For each 
pair of r and noise level a we generate N = 1000 signals. 
When evaluating the theoretical bounds, we cannot use the 
value of ar as exhaustive search for its value is unfeasible. We 
therefore use the expression given in Eq. (|3Jt . where we plug 
into it an empirical distribution of the values of £ and a^ that 
is computed from the signal examples, instead of the exact one 
we have used for the low dimensional setups. The empirical 
ratios of success and their theoretical lower bounds are shown 
in Fig. [To] for the three types of analysis dictionaries of size 
200-by-lOO. Each of these ratios is displayed as a matrix where 
white corresponds to one and black corresponds to zero. 

Several observations can be made from Fig. [TO] First, the 
general behavior of the three dictionary types remain as before: 
The performance is best for rimp, second best for il^Anx 
and the worse for CIrand, both in terms of the empirical 



and the theoretical success rates. Secondly, for floiF and 
^Mix the best performance is obtained for low SNR levels 
and low subspace dimensions r (the top left corner of the 
matrix). This is a desired behavior due to the fact that we 
typically want a low subspace dimension, which improves 
the denoising performance. For ^^irand however, the best 
theoretical results are obtained for low SNR levels and high 
values of r (the bottom left corner). The theoretical predictions 
for this dictionary are less reliable, as we can see that the actual 
performance is quite similar for all values of r. 

V. Relation to Existing Results 

There are several exiting contributions in the published 
literature on developing pursuit algorithms for the co-sparse 
analysis model and studying their performance from a theo- 
retical stand-point. Here we mention several papers that are 
of relevance to this work. We provide a brief review of their 
content, followed by a discussion on the relation to our results. 

The first work we briefly refer to is (E2l . which concentrates 
on the analysis dictionary learning problem. Two greedy 
analysis pursuit algorithms are developed for the denoising 
problem, as part of the overall learning paradigm - these 
algorithms are the Backward Greedy (BG) and the Optimized 
BG (OBG). Both these algorithms are constrcuted by imitat- 
ing synthesis based pursuit methods, and brought without a 
theoretical justification of any sort. Interestingly, the work in 
II22I provides an empirical evidence for the positive effect that 
strong linear dependencies within the analysis dictionary have 
on the success of pursuit algorithms. 

The work of lfT6l . ll20l considers a noise-free measurement 
setup where the co-sparse analysis signal is measured by 
y = Mx, from which we would like to recover x. The authors 
of |[T6l . II20I explore various uniqueness properties of this 
problem setup and suggest using either an analysis £i-norm 
minimization or a Greedy-Analysis-Pursuit (GAP) algorithm 
(note that GAP is different from the above mentioned BG 
and OBG - see more in ll22l ) for recovering the signal. They 
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Figure 10. Empirical ratios of success and their theoretical lower bounds for the thresholding algorithm on three types of analysis dictionaries of size 
200 X 100 for varying analysis subspace dimensions r and SNR levels. For each pair of r and SNR we generate A^ = 1000 signals. The theoretical bounds 
are computed using Eq. 1331 by plugging into it the empirical distribution of £ and a{^, which is computed from these signals. Left: The empirical ratios of 
success. Right: The theoretical bounds. 



analyze the perfonnance of these pursuit algorithms for the 
noise-free setup, deriving a sufficient condition for success 
of both algorithms in terms of the analysis dictionary il, the 
true co-support A of x and the null-space of M. Due to its 
apparent similarity to the ERC for the synthesis model, the 
derived condition is termed analysis ERC. 

The theoretical study of analysis ^i-norm based pursuit 
in a measurement setup is also the main focus of another 
recent work ET\ . This includes the derivation of conditions 
for noiseless identifiability and robustness to bounded noise, 
in terms of the sign pattern of fix and assuming that the 
null spaces of the measurement matrix M and the analysis 
dictionary fl intersect only at the zero vector Note that all 
of the resulting conditions in lfT6l . 1201 . II2TI are somewhat 
implicit, especially in the latter work, where the condition 
involves an inner optimization stage for a given sign pattern. 



This makes the derived conditions hard to interpret. 

A different work altogether is proposed in llT3l . The authors 
lfT3ll suggest a hybrid viewpoint to the synthesis and analysis 
models, where the signal of interest is a synthesis-and-analysis 
signal, constructed as x = Da with a sparse synthesis 
representation a.. However, this signal is also characterized 
as an analysis signal in the sense that it has a small £1 energy 
in the tail of the analysis representation D a. They suggest 
using an analysis-based approach for recovering the signal 
from its undersampled and noisy measurements y = Mx + e. 
Their approach is based on £i-norm sparsity of D x deriving 
a theoretical upper bound on the denoising error obtained by 
£1 analysis pursuit in this setup. To obtain the desired bound 
they require the measurement matrix M to satisfy a certain 
property adapted to D, termed D-RIP, which is similar to the 
well-known RIP aside from a delicate modification - instead 



of bounding the €2 norm of Mv for all fc-sparse vectors v, the 
norm of Mv is bounded for all vectors v that can be expressed 
as a linear combination of k columns of D. 

The work of ifTTl suggests a family of new pursuit algo- 
rithms for recovering co-sparse analysis signals from their un- 
dersampled measurements. These algorithms are analogous to 
the synthesis-based iterative hard thresholding algorithm, with 
a modification of the projection step intended for adapting this 
framework to the analysis model. The authors of [|171 present 
theoretical recovery guarantees for these analysis pursuit algo- 
rithms in the noiseless setup, assuming that the measurement 
matrix satisfies the Jl-RIP (an analysis counterpart for the D- 
RIP of HI). 

In this paper we focus on a denoising setup, similar to 
II22I and assume no measurement matrix. Our focus is the 
most simple analysis pursuit algorithm - the thresholding. This 
allows us to remove some of the ambiguities that are present in 
previous works, where the resulting theoretical conditions mix 
both the measurement matrix M and the analysis dictionary fi; 
we focus on internal properties of il only. Indeed, our derived 
theoretical guarantees are expressed in terms of the noise level, 
the co-sparsity £ of the signal over ft and internal properties 
of Jl. Instead of using dictionary measures that mimic the 
synthesis counterpart model, as practiced in ll20l . which uses 
analysis ERC, or llT3l . ifTTl . which use RIP-like properties, 
we suggest a novel measure, termed Restricted Orthogonal 
Projection Property (ROPP), which seems to be more relevant 
to analysis dictionaries. This property is much more explicit 
than the one arising from the theoretical analysis of lETI . Our 
derived results are simple to interpret, and specifically we see 
that strong linear dependencies improve the pursuit algorithm's 
success rate. 

VI. Conclusions 

In this work we have made an initial attempt at addressing 
the question of what makes an analysis dictionary suitable for 
co-sparse estimation. We have concentrated on a denoising 
setup and considered the use of a thresholding algorithm for 
the corresponding analysis pursuit problem. Our experiments 
show that this simple algorithm can perform quite well for 
certain analysis dictionaries, while failing on others. To better 
understand this behavior we further explored the performance 
of this algorithm in the presence of white Gaussian random 
noise, developing theoretical guarantees for the ability of the 
algorithm to recover the true underlying co-support. This study 
reveals two significant properties of an analysis dictionary that 
are key in dictating whether the pursuit will succeed or fail: 
The degree of linear dependencies between rows of J7 and 
the level of independence between subsets of rows and other 
atoms, a property we termed ROPP. We have found that it is 
desired to have many linear dependencies, as they increase the 
co-sparsity level. Similarly, the ROPP constant should be as 
high as possible. Finally, we have shown how the developed 
theoretical guarantees can explain our empirical results and 
predict them quite well. This work gives rise to various open 
questions that will be the topics of future research. These 
include topics such as these: 



1) While this work concentrated on the thresholding al- 
gorithm, a similar theoretical study should be given to 
other pursuit algorithms. Perhaps the quality measures 
we identified in this work could be of help in such study. 

2) This work defines the success of the pursuit algorithm by 
the complete identification of the co-support. However, 
this algorithm may perform rather well (in denoising 
terms) even in situations where only part of the support 
has been found. Extending this work to cover such cases 
would improve our prediction for the range of success 
of the thresholding algorithm. 

3) How could we incorporate the proposed quality mea- 
sures for n directly into the dictionary learning process? 
By doing so we may design better analysis dictionaries, 
which will ultimately lead to performance improvement 
and make the analysis model and its learned dictionary 
suitable for a wide range of processing applications. 
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